-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Fix binary search #3
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was sort-of intentional, we just care of being close enough to the target. Because of the invariant described in the inline comments, the algorithm is correct, the "fix" would be to use "high" quantization rather than "mid". Of course then you need to recompute the results with that value.
It is not clear that your fix is correct.
@@ -64,19 +64,23 @@ def quantize_array_lbda(A, lbda): | |||
def quantize_array_target(A, target_err): | |||
low = 1 | |||
high = 128 | |||
mid = low |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this
while high - low > 1: | ||
mid = (high + low) / 2 | ||
while low < high: | ||
mid = low + (high - low) / 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python has arbitrary precision integers, no need to do this
quant_A, dequant_A = quantize_array(A, mid) | ||
mean_err = np.mean(np.sqrt(np.sum((dequant_A - A)**2, axis=-1)) / A_norms) | ||
logging.info("Binary search: q=%d, err=%.3f", mid, mean_err) | ||
|
||
if mean_err > target_err: | ||
low = mid | ||
low = mid + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks the loop invariant that error(high) <= target_err < error(low)
else: | ||
high = mid | ||
|
||
mid = low | ||
quant_A, dequant_A = quantize_array(A, mid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copypasta
Yeah, I understand that the result is always near the error boundary. Just to make sure the codes express the intention clearly. And no, the algorithm is not right, simply returning This error comes from the fact that you're using I wrote an example test to show that my algorithm is right (I can try to prove it later), and current codes have problems. https://gist.github.com/sleepsort/708de5484d8484471ed33c4230790d39 |
That doesn't satisfy the invariant I mentioned in the comment. What is the invariant of your algorithm? |
Ah I see, sorry I didn't understand the term
Just in theory, it is possible that we miss a better compression method by returning the indice 1 instead of 0. And this example breaks the invariant as well. I might need some time to figure out the invariant in my algorithm... My intention is to make sure that error(low) > target_error will never happen, unless all the numbers in the array has error(i) > target_error, which already meets the exit condition. Can that be considered as an |
That still doesn't satisfy the invariant (in this case, that error(low) > target). I'm not sure you're familiar with standard algorithm correctness proofs, see for example https://en.wikipedia.org/wiki/Loop_invariant. In the algorithm as it is now, we have: From the postcondition you can derive that low is the largest integer with error > target, and high is the smallest integer with error <= target. So if you want the best compression such that the error is <= target, just use high. So, why do you want to change binary search implementation? |
Thanks for the link, I'll check it later. Yes, your algorithm is right if the invariant is true. My point is that, we cannot always assert that the real data meets this invariant. For example edge cases sush as error(i) == error(j), and error(low) == target. The codes are vulnerable to those cases, which might happen in production. And my initial intention was to make this binary search more general. |
This is not an edge case, and the algorithm still works. About the preconditions, you can check them before running the search and decide what to do if they are not met: in this case it means that the initial range is not wide enough and we should just stop the algorithm, rather than trying to return an endpoint of the range. |
Oh, to clarity, I meant it breaks the precondition instead of invariant. And I understand the difference in our preferences : ). My preference is on implementation that handles all possible corner cases (my understanding of As my understanding, to implement your algorithm to make sure it is error-immune, the code will be like
And my algorithm can remove [1], [2], and report exception of [2] when [4] is calculated. No matter what is the initial setting of About the invariant in my algorithm, how about this? First assumption is dummy, and exception handling of second assumption can be handled after the for loop, so the code looks like:
And to explain the loop in my codes:
|
Basically we cannot reuse exit status (quant_A, mid) as final result, as the following log shows, current logic will produce quantized vector with
mean_err > target_error
.Since there are log(N) candidate nodes in the search tree, It would be simpler / easier if we just calculate it in the last pass.
Log before fix:
https://gist.github.com/sleepsort/1d0d1582825c6ea248eca19a0a471d1f
Log after fix:
https://gist.github.com/sleepsort/64ac49af24dc98eefa3ea8c4575ab3b6
To reproduce:
https://gist.github.com/sleepsort/0585a47f42d200b1d445201e3d928474
./model_quantization.py quant test.w2v.txt /tmp/test.w2v